Search CORE

15 research outputs found

Joint Entity Extraction and Assertion Detection for Clinical Text

Author: Bhatia Parminder
Celikkaya Busra
Khalilia Mohammed
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the Conditional Softmax Shared Decoder architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.Comment: Accepted at the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019

arXiv.org e-Print Archive

Crossref

Relational data clustering algorithms with biomedical applications

Author: Khalilia Mohammed A.
Publication venue: University of Missouri--Columbia
Publication date
Field of study

University of Missouri: MOspace

SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks

Author: Hammouda Tymaa
Jarrar Mustafa
Khalilia Mohammed
Malaysha Sanad
Publication venue
Publication date: 29/10/2023
Field of study

SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, which are all sense-annotated. The corpus is annotated using two different sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how tokens and senses are associated. Instead of linking a token to only one intended sense, SALMA links a token to multiple senses and provides a score to each sense. A smart web-based annotation tool was developed to support scoring multiple senses against a given word. In addition to sense annotations, we also annotated the corpus using six types of named entities. The quality of our annotations was assessed using various metrics (Kappa, Linear Weighted Kappa, Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error), which show very high inter-annotator agreement. To establish a Word Sense Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word Sense Disambiguation system using Target Sense Verification. We used this system to evaluate three Target Sense Verification models available in the literature. Our best model achieved an accuracy with 84.2% using Modern and 78.7% using Ghani. The full corpus and the annotation tool are open-source and publicly available at https://sina.birzeit.edu/salma/

arXiv.org e-Print Archive

ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic

Author: Birim Ahmet
Erden Mustafa
Ghanem Sana
Jarrar Mustafa
Khalilia Mohammed
Publication venue
Publication date: 29/10/2023
Field of study

This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain. Our dataset was arabized and localized from the original English Banking77 dataset, which consists of 13,083 queries to ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect, with each query classified into one of the 77 classes (intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect, respectively. We performed extensive experimentation in which we simulated low-resource settings, where the model is trained on a subset of the data and augmented with noisy queries to simulate colloquial terms, mistakes and misspellings found in real NLP systems, especially live chat queries. The data and the models are publicly available at https://sina.birzeit.edu/arbanking77

arXiv.org e-Print Archive

WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

Author: Abdul-Mageed Muhammad
Elmadany AbdelRahim
Hamad Nagham
Jarrar Mustafa
Khalilia Mohammed
Omar Alaa'
Talafha Bashar
Publication venue
Publication date: 24/10/2023
Field of study

We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate meaningful comparisons between different NER approaches. WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 unique teams registered for this shared task, with 11 of them actively participating in the test phase. Specifically, 11 teams participated in FlatNER, while

8

teams tackled NestedNER. The winning teams achieved F1 scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively

arXiv.org e-Print Archive

Predicting disease risks from highly imbalanced data using random forest

Author: AP Bradley
C Chen
D Palmer
DA Davis
DH Mantzaris
E Cohen
F Provost
HCUP Project
J Mingers
JR Quinlan
L Breiman
L Breiman
L Breiman
M Skubic
Mihail Popescu
Mohammed Khalilia
N Japkowicz
P Hebert
Sounak Chakraborty
ST Moturu
T Hastie
T Yi
V Fuster
W Yu
W Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. Methods We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. Results We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. Conclusions In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PDA project [abstract]

Author: Khalilia Mohammed
Publication venue: University of Missouri--Columbia. Office of Undergraduate Research
Publication date: 01/01/2004
Field of study

Faculty Mentor: Dr. Marjorie Skubic, Computer Engineering and Computer ScienceAbstract only availableThe goal of this project is to create a robot interface that allows a user to guide and control a robot to perform some task. The assumption is that, although the user may be a domain expert in how the task should be done, he is not an expert in robotics. During the actual robot use, he should focus on the task to be done rather than worrying about the robot or the interaction modality. To address this goal, we have been investigating the use of hand-drawn route maps, in which the user sketches an approximate representation of the environment and then sketches the desired robot trajectory with respect to that environment. The objective in the sketch interface is to extract spatial information about the map and a qualitative path through the landmarks drawn on the sketch. This information is used to build a task representation for the robot, which operates as a semiautonomous vehicle. The stylus interface of the PDA allows the user to sketch a map much as you would on paper. The PDA captures the string of (x,y) coordinates sketched on the screen, which forms a digital representation suitable for processing. The user first draws a representation of the environment by sketching the approximate boundary of each object. During the sketching process, a delimiter is included to separate the string of coordinates for each object in the environment. After all of the environment objects have been drawn, another delimiter is included to indicate the start of the robot trajectory, and the user sketches the desired path of the robot, relative to the sketched environment

University of Missouri: MOspace

Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT

Author: Ghanem Sana
Jarrar Mustafa
Khalilia Mohammed
Publication venue
Publication date: 23/05/2022
Field of study

This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning and AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available

arXiv.org e-Print Archive